cluster emmean SE df lower.CL upper.CL
1 0.0522 0.0737 97 -0.094 0.198
2 0.7704 0.0796 97 0.612 0.928
3 -0.7147 0.0737 97 -0.861 -0.568
Confidence level used: 0.95
Clustering using agglomerative (hierarchical) clustering is common.
Hypothesis testing of the difference in means of such clusters has inflated Type I error rate.
Clustering and hypothesis testing performed on same dataset (‘double dipping’) is the cause.
Selective inference can control Type I error rate.
A staple of modern statistics!
\[H_0: \textrm{Null hypothesis}\]
\[H_A: \textrm{Alternative hypothesis}\]
For example,
\(H_0:\) means of two clusters are indistinguishable (\(\mu_\mathcal{C_1} = \mu_\mathcal{C_2}\))
\(H_A:\) means of two clusters are different (\(\mu_\mathcal{C_1} \ne \mu_\mathcal{C_2}\))
Statistic \(T\) summarises the data and has a known distribution, e.g., \(T \sim \chi^2_N\).
Calculate that statistic for the observed data, e.g., \(T_\textrm{obs}\).
Definition of p-value
Assuming the null hypothesis, \(H_0\), is true; what is the probability of getting \(T \ge T_\textrm{obs}\) or something more extreme?
\[p = \mathbb{P}_{H_0}(T \ge T_\textrm{obs})\]
A “small” p-value suggests that the data is incompatible with assumption that \(H_0\) is true.
But how small is “small”?
A Type I error is rejecting \(H_0\) when it is actually true.
\[\begin{array}{c|cc} & H_0 \textrm{ is True} & H_0 \textrm{ is False} \\ \hline \textrm{Reject }H_0 & \textbf{Type I error} & \textrm{True positive} \\ \textrm{Fail to reject }H_0 & \textrm{True negative} & \textrm{Type II error} \end{array}\]In the long run, we can control our error rate to \(\alpha \in [0,1]\) by only rejecting \(H_0\) when \(p \lt \alpha\).
\(\alpha\) is decided by the investigator.
A “bottom-up” method of hierarchical clustering:
cluster emmean SE df lower.CL upper.CL
1 0.0522 0.0737 97 -0.094 0.198
2 0.7704 0.0796 97 0.612 0.928
3 -0.7147 0.0737 97 -0.861 -0.568
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
cluster1 - cluster2 -0.718 0.108 97 -6.622 <.0001
cluster1 - cluster3 0.767 0.104 97 7.359 <.0001
cluster2 - cluster3 1.485 0.108 97 13.692 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
But!
\[ X_1 \sim N(0,1),\ X_2 \sim N(0,1) \qquad X_1 \perp X_2\]
Using the same data to generate a hypothesis and test that hypothesis.
e.g., cluster the data and then test the different of means of those clusters.
cluster emmean SE df lower.CL upper.CL
1 -0.0327 0.0947 47 -0.223 0.158
2 0.7462 0.1114 47 0.522 0.970
3 -0.8094 0.0922 47 -0.995 -0.624
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
cluster1 - cluster2 -0.779 0.146 47 -5.326 <.0001
cluster1 - cluster3 0.777 0.132 47 5.877 <.0001
cluster2 - cluster3 1.556 0.145 47 10.756 <.0001
P value adjustment: tukey method for comparing a family of 3 estimates
There is still leakage of information.
We have a matrix normal distribution:
\[ \boldsymbol{X} \sim \mathcal{MN_{n \times q}}(\boldsymbol\mu, \boldsymbol I_n, \sigma^2\boldsymbol{I}_q)\]
where \(\boldsymbol\mu\) has as rows vectors \(\mu_i\) with \(q\) elements.
After applying clustering to \(\boldsymbol{x}\) we get the clusters \(\{\mathcal{\hat{C}}_1, \dots, \mathcal{\hat{C}}_K\}\).
The mean of any cluster \(\mathcal{\hat{C}_k}\) in :
\[\bar\mu_{\mathcal{\hat{C}_k}} = \frac{1}{|\mathcal{\hat{C}}_k|} \sum_{i \in \mathcal{\hat{C}}_k} \mu_i\]
We want to test
\[H_0: \bar\mu_{\mathcal{\hat{C}_{k}}} = \bar\mu_{\mathcal{\hat{C}_{k'}}}\]
Wald Test
\[p = \mathbb{P}\left( \chi^2_q \ \ge \ \kappa \| \bar{x}_{\hat{\mathcal{C}}_k} - \bar{x}_{\hat{\mathcal{C}}_{k'}} \|_2^2 \right)\]
Corrected Test
\[p = \mathbb{P}_{H_0}\left( \| \bar{X}_{\hat{\mathcal{C}}_k} - \bar{X}_{\hat{\mathcal{C}}_{k'}} \|_2 \ \ge \ \| \bar{x}_{\hat{\mathcal{C}}_k} - \bar{x}_{\hat{\mathcal{C}}_{k'}} \|_2 \mid \textrm{Clustering results in }\hat{\mathcal{C}}_k, \hat{\mathcal{C}}_{k'} \right)\]
New p-value:
Of all the datasets that result in clusters \(\hat{\mathcal{C}}_k\) and \(\hat{\mathcal{C}}_{k'}\), what is the probability, assuming no difference in means, that we see such a large difference in the sample means \(\bar\mu_{\mathcal{\hat{C}_k}}\) and \(\bar\mu_{\mathcal{\hat{C}_{k'} }}\)?
1 2 3 4 5 6
Adelie 60 12 1 0 0 0
Chinstrap 4 2 0 0 27 1
Gentoo 0 0 0 57 1 0
Cluster 1 vs Cluster 2
$stat
[1] 10.41961
$pval
[1] 0.8667368
$trunc
Object of class Intervals
2 intervals over R:
(10.3122059802435, 16.3904453676166)
(131.758884280402, Inf)
Cluster 4 vs Cluster 5
$stat
[1] 18.86523
$pval
[1] 0.0004528178
$trunc
Object of class Intervals
2 intervals over R:
(16.8300111004978, 21.6868651391918)
(63.4548051430845, Inf)
We’re told to form hypotheses first before looking at the data but that’s not what happens in practice.
Double dipping will increase Type I error rate.
Data splitting does not mitigate double dipping and is wasteful.
Instead, condition on fact that the hypothesis was generated from the data.
Multiple testing is not always easy to detect.
Same issue appears in classification and regression trees, and changepoint detection.